Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate.
In this tutorial we are going to explore how to align short reads to a reference genome. Two of the most popular short read aligners used in bioinformatics are the Bowtie 2 and BWA aligners. Both of these aligners use the Burrows-Wheeler transform to build an index of the reference genome, however they differ from each other in respect of the alignment algorithm. It is important to say up front the alignment problem has not been solved - researchers continue to working on optimizing the efficiency of the read alignment procedure and strategies for determining the position of multi-mapping reads. With that said, lets get to the tutorial. The first thing to do is create a directory to store all the tutorial output files and data.
Create a working directory for this tutorial:
mkdir tutorial
Create a new environment called ‘alignment’ with all the required software installed:
conda create --yes --name alignment bowtie2 bwa entrez-direct samtools seqkit sra-tools=2.11.0=pl5262h37d2149_1 # only this version works currently (21/09/2022)
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
## Collecting package metadata (repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/alignment
##
## added / updated specs:
## - bowtie2
## - bwa
## - entrez-direct
## - samtools
## - seqkit
## - sra-tools==2.11.0=pl5262h37d2149_1
##
##
## The following packages will be downloaded:
##
## package | build
## ---------------------------|-----------------
## tbb-2021.5.0 | hb8565cd_5 154 KB conda-forge
## ------------------------------------------------------------
## Total: 154 KB
##
## The following NEW packages will be INSTALLED:
##
## bowtie2 bioconda/osx-64::bowtie2-2.4.5-py310h7d4de36_4
## bwa bioconda/osx-64::bwa-0.7.17-h1f540d2_9
## bzip2 conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
## c-ares conda-forge/osx-64::c-ares-1.18.1-h0d85af4_0
## ca-certificates conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
## curl conda-forge/osx-64::curl-7.83.1-h23f1065_0
## entrez-direct bioconda/osx-64::entrez-direct-16.2-h193322a_1
## gettext conda-forge/osx-64::gettext-0.19.8.1-hd1a6beb_1008
## hdf5 conda-forge/osx-64::hdf5-1.10.6-nompi_haae91d6_101
## htslib bioconda/osx-64::htslib-1.16-h567f53e_0
## icu conda-forge/osx-64::icu-70.1-h96cf925_0
## krb5 conda-forge/osx-64::krb5-1.19.3-hb98e516_0
## libcurl conda-forge/osx-64::libcurl-7.83.1-h23f1065_0
## libcxx conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
## libdeflate conda-forge/osx-64::libdeflate-1.13-h775f41a_0
## libedit conda-forge/osx-64::libedit-3.1.20191231-h0678c8f_2
## libev conda-forge/osx-64::libev-4.33-haf1e3a3_1
## libffi conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
## libgfortran conda-forge/osx-64::libgfortran-4.0.0-7_5_0_h1a10cd1_23
## libgfortran4 conda-forge/osx-64::libgfortran4-7.5.0-h1a10cd1_23
## libiconv conda-forge/osx-64::libiconv-1.16-haf1e3a3_0
## libidn2 conda-forge/osx-64::libidn2-2.3.3-hac89ed1_0
## libnghttp2 conda-forge/osx-64::libnghttp2-1.47.0-h5aae05b_1
## libsqlite conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
## libssh2 conda-forge/osx-64::libssh2-1.10.0-h47af595_3
## libunistring conda-forge/osx-64::libunistring-0.9.10-h0d85af4_0
## libxml2 conda-forge/osx-64::libxml2-2.9.14-hea49891_4
## libzlib conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
## llvm-openmp conda-forge/osx-64::llvm-openmp-14.0.4-ha654fa7_0
## ncbi-ngs-sdk bioconda/osx-64::ncbi-ngs-sdk-2.11.2-h247ad82_0
## ncurses conda-forge/osx-64::ncurses-6.3-h96cf925_1
## openssl conda-forge/osx-64::openssl-3.0.5-hfd90126_2
## ossuuid conda-forge/osx-64::ossuuid-1.6.2-h0a44026_1000
## perl conda-forge/osx-64::perl-5.26.2-hbcb3906_1008
## perl-uri bioconda/osx-64::perl-uri-1.71-pl526_3
## perl-xml-libxml bioconda/osx-64::perl-xml-libxml-2.0132-pl526h08abf6f_1
## perl-xml-namespac~ bioconda/osx-64::perl-xml-namespacesupport-1.11-pl526_1
## perl-xml-sax bioconda/osx-64::perl-xml-sax-0.99-pl526_1
## perl-xml-sax-base bioconda/osx-64::perl-xml-sax-base-1.09-pl526_0
## pip conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
## python conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
## python_abi conda-forge/osx-64::python_abi-3.10-2_cp310
## readline conda-forge/osx-64::readline-8.1.2-h3899abd_0
## samtools bioconda/osx-64::samtools-1.15.1-h7e39424_1
## seqkit bioconda/osx-64::seqkit-2.3.1-h527b516_0
## setuptools conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
## sra-tools bioconda/osx-64::sra-tools-2.11.0-pl5262h37d2149_1
## tbb conda-forge/osx-64::tbb-2021.5.0-hb8565cd_5
## tk conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
## tzdata conda-forge/noarch::tzdata-2022c-h191b570_0
## wget conda-forge/osx-64::wget-1.20.3-hd3787cc_1
## wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
## xz conda-forge/osx-64::xz-5.2.6-h775f41a_0
## zlib conda-forge/osx-64::zlib-1.2.12-hfd90126_3
## zstd conda-forge/osx-64::zstd-1.5.2-hfa58983_4
##
##
##
## Downloading and Extracting Packages
##
tbb-2021.5.0 | 154 KB | | 0%
tbb-2021.5.0 | 154 KB | # | 10%
tbb-2021.5.0 | 154 KB | ####1 | 41%
tbb-2021.5.0 | 154 KB | ######2 | 62%
tbb-2021.5.0 | 154 KB | ########## | 100%
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate alignment
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the ‘alignment’ environment:
conda activate alignment
Test that Bowtie 2 and BWA are available:
which bowtie2
which bwa
## /opt/miniconda3/envs/alignment/bin/bowtie2
## /opt/miniconda3/envs/alignment/bin/bwa
The data we are going to use in this tutorial is from the original research paper which first reported the genome sequence of the SARS-CoV-2 virus responsible for the COVID-19 pandemic across the globe. The paper is listed below for you to read in your own time, if you are so inclined:
A new coronavirus associated with human respiratory disease in China
Within the paper there is a ‘Data availability’ section telling you where to download the sequencing data. To save you some time, I can tell you that the sequencing data was deposited in the SRA database under the study accession number SRP245409. Have a look through the study page to get familiar with the data, ask yourself questions like ‘how many runs are in this study?’, ‘what sequencing machine was used?’, ‘how many reads did they sequence?’, etc.
As well as the sequencing data, we also need to download a copy of the SARS-CoV-2 reference genome. The NCBI hosts the reference genome in the GenBank database under the accession number MN908947. Again, get yourself acquainted with the data by exploring the GenBank page for the genome, ask yourself questions like ‘who submitted the genome sequence?’, ‘how long is the genome?’, ‘how many genes do it contain?’, etc.
Now that you’ve had a good rummage around, lets get to business and download the data. First, create a directory to actually store the sequencing data and reference genome.
mkdir tutorial/data
Although we discussed the advantages of short read aligners in the context of large genomes. This tutorial would take much longer if we went down that route. Instead, small viral genomes such as SARS-CoV-2 are an excellent substitute for teaching. Use the efetch command from EDirect to download the reference genome:
efetch -db nuccore -id MN908947 -format fasta > tutorial/data/MN908947.fasta
To save time and memory requirements, we are only going to use a small sample of the sequencing data. Use the fastq-dump command from the SRA Toolkit to download just 1,000 of the paired-end reads from the SARS-CoV-2 sequencing data:
fastq-dump -X 1000 -O tutorial/data --split-files SRR15168839
## Read 1000 spots for SRR15168839
## Written 1000 spots for SRR15168839
Now that we have the reads and the reference genome, lets start aligning!
The first aligner we’re going to use is Bowtie 2 - the second major version of the Bowtie alignment software. The general procedure for read alignment is to first index the reference genome sequence, then align the reads to the genome sequence with the help of the index to speed up mapping. I should warn you now, aligners have a huge number of settings. On the one hand this is great because it means they can be customized for very specific use cases, on the other hand all those settings can be quite bewildering for someone inexperienced. Luckily for you, unless you have a particularly abnormal data set - the defaults work quite well! In fact, Bowtie 2 comes with has handy presets you can choose to automatically set multiple parameters, they even have helpful names like ‘–very-fast’ or ‘–very-sensitive’.
Lets begin by creating a directory to store all the Bowtie 2 output files:
mkdir tutorial/bowtie2
To index a reference sequence, you need to use the bowtie2-build command:
bowtie2-build tutorial/data/MN908947.fasta tutorial/bowtie2/MN908947
## Settings:
## Output files: "tutorial/bowtie2/MN908947.*.bt2"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 4 (one in 16)
## FTable chars: 10
## Strings: unpacked
## Max bucket size: default
## Max bucket size, sqrt multiplier: default
## Max bucket size, len divisor: 4
## Difference-cover sample period: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## tutorial/data/MN908947.fasta
## Building a SMALL index
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 7475
## Using parameters --bmax 5607 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 5607 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 3737 (target: 5606)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
## Reserving size (5607) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 4574 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4575 for bucket 1
## Getting block 2 of 8
## Reserving size (5607) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 2668 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2669 for bucket 2
## Getting block 3 of 8
## Reserving size (5607) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 3483 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 3484 for bucket 3
## Getting block 4 of 8
## Reserving size (5607) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 4420 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4421 for bucket 4
## Getting block 5 of 8
## Reserving size (5607) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 4901 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4902 for bucket 5
## Getting block 6 of 8
## Reserving size (5607) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 3813 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 3814 for bucket 6
## Getting block 7 of 8
## Reserving size (5607) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 2667 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2668 for bucket 7
## Getting block 8 of 8
## Reserving size (5607) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 3370 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 3371 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 8954
## fchr[G]: 14446
## fchr[T]: 20309
## fchr[$]: 29903
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4204544 bytes to primary EBWT file: tutorial/bowtie2/MN908947.1.bt2.tmp
## Wrote 7480 bytes to secondary EBWT file: tutorial/bowtie2/MN908947.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 29903
## bwtLen: 29904
## sz: 7476
## bwtSz: 7476
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 1869
## offsSz: 7476
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 156
## numLines: 156
## ebwtTotLen: 9984
## ebwtTotSz: 9984
## color: 0
## reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 7475
## Using parameters --bmax 5607 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 5607 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 3737 (target: 5606)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
## Reserving size (5607) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 2598 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2599 for bucket 1
## Getting block 2 of 8
## Reserving size (5607) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 5142 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 5143 for bucket 2
## Getting block 3 of 8
## Reserving size (5607) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 4371 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4372 for bucket 3
## Getting block 4 of 8
## Reserving size (5607) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 4433 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4434 for bucket 4
## Getting block 5 of 8
## Reserving size (5607) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 3267 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 3268 for bucket 5
## Getting block 6 of 8
## Reserving size (5607) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 2880 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2881 for bucket 6
## Getting block 7 of 8
## Reserving size (5607) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 5175 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 5176 for bucket 7
## Getting block 8 of 8
## Reserving size (5607) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 2030 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2031 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 8954
## fchr[G]: 14446
## fchr[T]: 20309
## fchr[$]: 29903
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4204544 bytes to primary EBWT file: tutorial/bowtie2/MN908947.rev.1.bt2.tmp
## Wrote 7480 bytes to secondary EBWT file: tutorial/bowtie2/MN908947.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 29903
## bwtLen: 29904
## sz: 7476
## bwtSz: 7476
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 1869
## offsSz: 7476
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 156
## numLines: 156
## ebwtTotLen: 9984
## ebwtTotSz: 9984
## color: 0
## reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming tutorial/bowtie2/MN908947.3.bt2.tmp to tutorial/bowtie2/MN908947.3.bt2
## Renaming tutorial/bowtie2/MN908947.4.bt2.tmp to tutorial/bowtie2/MN908947.4.bt2
## Renaming tutorial/bowtie2/MN908947.1.bt2.tmp to tutorial/bowtie2/MN908947.1.bt2
## Renaming tutorial/bowtie2/MN908947.2.bt2.tmp to tutorial/bowtie2/MN908947.2.bt2
## Renaming tutorial/bowtie2/MN908947.rev.1.bt2.tmp to tutorial/bowtie2/MN908947.rev.1.bt2
## Renaming tutorial/bowtie2/MN908947.rev.2.bt2.tmp to tutorial/bowtie2/MN908947.rev.2.bt2
The command should print many lines of output then quit. When the command completes, the tutorial/bowtie2 directory will contain six new files that all start with MN908947 and end with .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files constitute the index for the Bowtie 2 aligner. The next step is to align the reads. Remember what I said about the aligner having lots of settings? Now would be a good time to have a quick look at all those settings. I know you’re excited to get to the alignment step, but understanding the capabilities of the aligner is important. Go ahead and display the usage information for the bowtie2 command:
bowtie2 -h
## Bowtie 2 version 2.4.5 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
## Usage:
## bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
##
## <bt2-idx> Index filename prefix (minus trailing .X.bt2).
## NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
## <m1> Files with #1 mates, paired with files in <m2>.
## Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
## <m2> Files with #2 mates, paired with files in <m1>.
## Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
## <r> Files with unpaired reads.
## Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
## <i> Files with interleaved paired-end FASTQ/FASTA reads
## Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
## <bam> Files are unaligned BAM sorted by read name.
## <sam> File for SAM output (default: stdout)
##
## <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
## specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
##
## Options (defaults in parentheses):
##
## Input:
## -q query input files are FASTQ .fq/.fastq (default)
## --tab5 query input files are TAB5 .tab5
## --tab6 query input files are TAB6 .tab6
## --qseq query input files are in Illumina's qseq format
## -f query input files are (multi-)FASTA .fa/.mfa
## -r query input files are raw one-sequence-per-line
## -F k:<int>,i:<int> query input files are continuous FASTA where reads
## are substrings (k-mers) extracted from a FASTA file <s>
## and aligned at offsets 1, 1+i, 1+2i ... end of reference
## -c <m1>, <m2>, <r> are sequences themselves, not files
## -s/--skip <int> skip the first <int> reads/pairs in the input (none)
## -u/--upto <int> stop after first <int> reads/pairs (no limit)
## -5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
## -3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
## --trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end
## If the read end is not specified then it defaults to 3 (0)
## --phred33 qualities are Phred+33 (default)
## --phred64 qualities are Phred+64
## --int-quals qualities encoded as space-delimited integers
##
## Presets: Same as:
## For --end-to-end:
## --very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
## --fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
## --sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
## --very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
##
## For --local:
## --very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
## --fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
## --sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)
## --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
##
## Alignment:
## -N <int> max # mismatches in seed alignment; can be 0 or 1 (0)
## -L <int> length of seed substrings; must be >3, <32 (22)
## -i <func> interval between seed substrings w/r/t read len (S,1,1.15)
## --n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
## --dpad <int> include <int> extra ref chars on sides of DP table (15)
## --gbar <int> disallow gaps within <int> nucs of read extremes (4)
## --ignore-quals treat all quality values as 30 on Phred scale (off)
## --nofw do not align forward (original) version of read (off)
## --norc do not align reverse-complement version of read (off)
## --no-1mm-upfront do not allow 1 mismatch alignments before attempting to
## scan for the optimal seeded alignments
## --end-to-end entire read must align; no clipping (on)
## OR
## --local local alignment; ends might be soft clipped (off)
##
## Scoring:
## --ma <int> match bonus (0 for --end-to-end, 2 for --local)
## --mp <int> max penalty for mismatch; lower qual = lower penalty (6)
## --np <int> penalty for non-A/C/G/Ts in read/ref (1)
## --rdg <int>,<int> read gap open, extend penalties (5,3)
## --rfg <int>,<int> reference gap open, extend penalties (5,3)
## --score-min <func> min acceptable alignment score w/r/t read length
## (G,20,8 for local, L,-0.6,-0.6 for end-to-end)
##
## Reporting:
## (default) look for multiple alignments, report best, with MAPQ
## OR
## -k <int> report up to <int> alns per read; MAPQ not meaningful
## OR
## -a/--all report all alignments; very slow, MAPQ not meaningful
##
## Effort:
## -D <int> give up extending after <int> failed extends in a row (15)
## -R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)
##
## Paired-end:
## -I/--minins <int> minimum fragment length (0)
## -X/--maxins <int> maximum fragment length (500)
## --fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
## --no-mixed suppress unpaired alignments for paired reads
## --no-discordant suppress discordant alignments for paired reads
## --dovetail concordant when mates extend past each other
## --no-contain not concordant when one mate alignment contains other
## --no-overlap not concordant when mates overlap at all
##
## BAM:
## --align-paired-reads
## Bowtie2 will, by default, attempt to align unpaired BAM reads.
## Use this option to align paired-end reads instead.
## --preserve-tags Preserve tags from the original BAM record by
## appending them to the end of the corresponding SAM output.
##
## Output:
## -t/--time print wall-clock time taken by search phases
## --un <path> write unpaired reads that didn't align to <path>
## --al <path> write unpaired reads that aligned at least once to <path>
## --un-conc <path> write pairs that didn't align concordantly to <path>
## --al-conc <path> write pairs that aligned concordantly at least once to <path>
## (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
## --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
## --quiet print nothing to stderr except serious errors
## --met-file <path> send metrics to file at <path> (off)
## --met-stderr send metrics to stderr (off)
## --met <int> report internal counters & metrics every <int> secs (1)
## --no-unal suppress SAM records for unaligned reads
## --no-head suppress header lines, i.e. lines starting with @
## --no-sq suppress @SQ header lines
## --rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field
## --rg <text> add <text> ("lab:value") to @RG line of SAM header.
## Note: @RG line only printed when --rg-id is set.
## --omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
## --sam-no-qname-trunc
## Suppress standard behavior of truncating readname at first whitespace
## at the expense of generating non-standard SAM.
## --xeq Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.
## --soft-clipped-unmapped-tlen
## Exclude soft-clipped bases when reporting TLEN
## --sam-append-comment
## Append FASTA/FASTQ comment to SAM record
##
## Performance:
## -p/--threads <int> number of alignment threads to launch (1)
## --reorder force SAM output order to match order of input reads
## --mm use memory-mapped I/O for index; many 'bowtie's can share
##
## Other:
## --qc-filter filter out reads that are bad according to QSEQ filter
## --seed <int> seed for random number generator (0)
## --non-deterministic
## seed rand. gen. arbitrarily instead of using read attributes
## --version print version information and quit
## -h/--help print this usage message
Now that you’ve skimmed the usage information, lets run the aligner! We need to provide the index of the reference genome, the FASTQ files containing the sequencing reads, and also the name of file to write the results:
bowtie2 -x tutorial/bowtie2/MN908947 -1 tutorial/data/SRR15168839_1.fastq -2 tutorial/data/SRR15168839_2.fastq > tutorial/bowtie2/SRR15168839.sam
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 910 (91.00%) aligned concordantly 0 times
## 90 (9.00%) aligned concordantly exactly 1 time
## 0 (0.00%) aligned concordantly >1 times
## ----
## 910 pairs aligned concordantly 0 times; of these:
## 870 (95.60%) aligned discordantly 1 time
## ----
## 40 pairs aligned 0 times concordantly or discordantly; of these:
## 80 mates make up the pairs; of these:
## 79 (98.75%) aligned 0 times
## 1 (1.25%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 96.05% overall alignment rate
This runs the Bowtie 2 aligner, which aligns a set of paired-end reads to the SARS-CoV-2 reference genome using the index generated in the previous step. The alignment results are written in SAM format to the file tutorial/bowtie2/SRR15168839.sam, and a short alignment summary is written to the console.
The summary provides some useful information on the mapping rate. In the example above, it tells you that of the 1000 paired-end reads, only 9% of them aligned ‘concordantly’. This term is used to label read pairs which align with the expected insert size in the expected orientation. For more details, see the section of the manual which describes the difference between concordant and discordant pairs. Of those reads which aligned ‘discordantly’, nearly 96% of them were aligned. Finally, the overall alignment rate for this data is 96%.
The SAM file format is the de-facto standard for reporting of alignments to a reference genome. These files can be manipulated using the SAMtools software package. We will learn more about SAMtools in the next workshop, so for now just follow the commands listed below.
Print SAM header only:
samtools view -H tutorial/bowtie2/SRR15168839.sam
## @HD VN:1.5 SO:unsorted GO:query
## @SQ SN:MN908947.3 LN:29903
## @PG ID:bowtie2 PN:bowtie2 VN:2.4.5 CL:"/opt/miniconda3/envs/alignment/bin/bowtie2-align-s --wrapper basic-0 -x tutorial/bowtie2/MN908947 -1 tutorial/data/SRR15168839_1.fastq -2 tutorial/data/SRR15168839_2.fastq"
## @PG ID:samtools PN:samtools PP:bowtie2 VN:1.15.1 CL:samtools view -H tutorial/bowtie2/SRR15168839.sam
Print SAM alignment records only:
# Output the first few lines of the file
samtools view tutorial/bowtie2/SRR15168839.sam | head
## SRR15168839.1 97 MN908947.3 16442 23 142M4I5M = 16435 -154 GCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAAAGATCGGAAG ABBBBFFFFFFFFGGGGGGGGGHHHGGGGGHHHHHHHHHHHHHHGHHHHHHHHHFHEHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHAGHHEGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGHHHHHHFHHHHHHHHHGGGD AS:i:-37 XN:i:0 XM:i:4 XO:i:1 XG:i:4 NM:i:8 MD:Z:25A116T1C0T1 YS:i:-39 YT:Z:DP
## SRR15168839.1 145 MN908947.3 16435 23 7M3I141M = 16442 154 CTTCCGATCTGCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAA ?GGGHHHHFGFHHHHHHHFHHGHGHHFHHHGGGGGHGHHHHHHHHHHHHHHEGHHHHHHHHHHHHHHHHGHHHHHHGHHHHHHHHHHHHHHHHHGGGGHHGGHGHHHHHHHHHHHHHHHHHHHHHGHHHGGGGGGGGGGFFFFFFFCCCBB AS:i:-39 XN:i:0 XM:i:5 XO:i:1 XG:i:3 NM:i:8 MD:Z:0G0G1A0T27A115 YS:i:-37 YT:Z:DP
## SRR15168839.2 97 MN908947.3 21305 42 151M = 21304 -152 GCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTCA CDDDDDDFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGHGGHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFGHHHHHHHHHHHIHHHHHHHHHGHHHHHH AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:58T92 YS:i:-10 YT:Z:DP
## SRR15168839.2 145 MN908947.3 21304 42 151M = 21305 152 TGCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTC GGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHIHIHHHHGHHHHGGGHHHHHHHHHHHGHHHHHHHHHHHGGGGGGGGGGFFFFFFFBBABA AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:0C58T91 YS:i:-5 YT:Z:DP
## SRR15168839.3 99 MN908947.3 28222 42 49M1D102M = 28285 214 AGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAAATGTCTCTAAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTG AAAAAFFFFFFCGGGGEFGGGGHHHHGHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHGHHHGGGGHHHHHHHGGGGHHHHHHGGGGGGGHHGHHHGGHGGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHGHHHHHHHGGGGGH AS:i:-23 XN:i:0 XM:i:3 XO:i:1 XG:i:1 NM:i:4 MD:Z:49^A8G0A0T91 YS:i:0 YT:Z:CP
## SRR15168839.3 147 MN908947.3 28285 42 151M = 28222 -214 TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC HHGDGGGHGHHHGGGGHHHHHHGFCGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGGGGGHHHGGGGGGGGGHGGHHHHFGFEGGGGGGGHHHHHHGHHHGHGHHHGGGGGGGGGGFFBFBFFBBBBB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YS:i:-23 YT:Z:CP
## SRR15168839.4 97 MN908947.3 17888 42 151M = 17887 -152 AAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTACA BBBBBFFFFFFFGGGGDGGGGGHHGHHHFHHHFHHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHFHHFHFHHHHHHHHHF AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:107T43 YS:i:-10 YT:Z:DP
## SRR15168839.4 145 MN908947.3 17887 42 151M = 17888 152 TAAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTAC HHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGHHHHHHIIIHHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGHGGGGGGGGGGFFFFFFFCCAAA AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:0G107T42 YS:i:-5 YT:Z:DP
## SRR15168839.5 97 MN908947.3 28286 42 151M = 28285 -152 GGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACCA BCCCCCCCCFCFGGGGGGGGGGHHHHGGGGGGGHHGHHHGGGFGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHGHHHGGGGGHHHGGGGGGGGGHHHHHHHGHHGGGGGGGGHGGHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHG AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:150G0 YS:i:0 YT:Z:DP
## SRR15168839.5 145 MN908947.3 28285 42 151M = 28286 152 TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC GHGGGGGHHHHHGGGGHHHHHHGFGGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHGHHHHHHHHHGHHHHHHHHHHHGGGGGHHHGGGGGGGGGHHGHHHHGHGGGGGGGGGHHHHHHHHHHGHHHHHGGGGGGGGGGFFDFCFFACBCA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YS:i:-5 YT:Z:DP
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1
Calculate read alignment statistics:
samtools flagstat tutorial/bowtie2/SRR15168839.sam
## 2000 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 0 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1921 + 0 mapped (96.05% : N/A)
## 1921 + 0 primary mapped (96.05% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 180 + 0 properly paired (9.00% : N/A)
## 1920 + 0 with itself and mate mapped
## 1 + 0 singletons (0.05% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
The next aligner we’re going to use is BWA - which stands for Burrows-Wheeler aligner. No prizes for guessing which indexing strategy this aligner uses! As mentioned in the introduction, BWA provides three different alignment algorithms. In practice, most people use the BWA-MEM algorithm as it is generally faster and more accurate. The general procedure for BWA is very similar to Bowtie 2 so we won’t go into as much detail about the individual steps as we did previously.
Lets begin by creating a directory to store all the BWA output files:
mkdir tutorial/bwa
As before, the next thing we need to do is index the reference sequence. To do this we use the bwa index command:
bwa index -p tutorial/bwa/MN908947 tutorial/data/MN908947.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p tutorial/bwa/MN908947 tutorial/data/MN908947.fasta
## [main] Real time: 0.007 sec; CPU: 0.025 sec
The command should print some lines of output then quit. When the command completes, the tutorial/bwa directory will contain five new files that all start with MN908947 and end with .amb, .ann, .bwt, .pac, and .sa. This constitutes the index for the BWA aligner.
Display usage information for the bwa mem command:
bwa mem
##
## Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
##
## Algorithm options:
##
## -t INT number of threads [1]
## -k INT minimum seed length [19]
## -w INT band width for banded alignment [100]
## -d INT off-diagonal X-dropoff [100]
## -r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
## -y INT seed occurrence for the 3rd round seeding [20]
## -c INT skip seeds with more than INT occurrences [500]
## -D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
## -W INT discard a chain if seeded bases shorter than INT [0]
## -m INT perform at most INT rounds of mate rescues for each read [50]
## -S skip mate rescue
## -P skip pairing; mate rescue performed unless -S also in use
##
## Scoring options:
##
## -A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]
## -B INT penalty for a mismatch [4]
## -O INT[,INT] gap open penalties for deletions and insertions [6,6]
## -E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
## -L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]
## -U INT penalty for an unpaired read pair [17]
##
## -x STR read type. Setting -x changes multiple parameters unless overridden [null]
## pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
## ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
## intractg: -B9 -O16 -L5 (intra-species contigs to ref)
##
## Input/output options:
##
## -p smart pairing (ignoring in2.fq)
## -R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
## -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]
## -o FILE sam file to output results to [stdout]
## -j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
## -5 for split alignment, take the alignment with the smallest coordinate as primary
## -q don't modify mapQ of supplementary alignments
## -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []
##
## -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
## -T INT minimum score to output [30]
## -h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]
## -a output all alignments for SE or unpaired PE
## -C append FASTA/FASTQ comment to SAM output
## -V output the reference FASTA header in the XR tag
## -Y use soft clipping for supplementary alignments
## -M mark shorter split hits as secondary
##
## -I FLOAT[,FLOAT[,INT[,INT]]]
## specify the mean, standard deviation (10% of the mean if absent), max
## (4 sigma from the mean if absent) and min of the insert size distribution.
## FR orientation only. [inferred]
##
## Note: Please read the man page for detailed description of the command line and options.
Run the BWA alignment command:
bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq > tutorial/bwa/SRR15168839.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 2000 sequences (302000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 996, 0, 0)
## [M::mem_pestat] skip orientation FF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (142, 148, 149)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (128, 163)
## [M::mem_pestat] mean and std.dev: (145.46, 5.30)
## [M::mem_pestat] low and high boundaries for proper pairs: (121, 170)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] skip orientation RR as there are not enough pairs
## [M::mem_process_seqs] Processed 2000 reads in 0.049 CPU sec, 0.049 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq
## [main] Real time: 0.054 sec; CPU: 0.057 sec
Print SAM header only:
samtools view -H tutorial/bwa/SRR15168839.sam
## @SQ SN:MN908947.3 LN:29903
## @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem tutorial/bwa/MN908947 tutorial/data/SRR15168839_1.fastq tutorial/data/SRR15168839_2.fastq
## @PG ID:samtools PN:samtools PP:bwa VN:1.15.1 CL:samtools view -H tutorial/bwa/SRR15168839.sam
Print SAM alignment records only:
# Output the first few lines of the file
samtools view tutorial/bwa/SRR15168839.sam | head
## SRR15168839.1 99 MN908947.3 16442 60 142M9S = 16442 141 GCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAAAGATCGGAAG ABBBBFFFFFFFFGGGGGGGGGHHHGGGGGHHHHHHHHHHHHHHGHHHHHHHHHFHEHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHAGHHEGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGHHHHHHFHHHHHHHHHGGGD NM:i:1 MD:Z:25A116 MC:Z:10S141M AS:i:137 XS:i:0
## SRR15168839.1 147 MN908947.3 16442 60 10S141M = 16442 -141 CTTCCGATCTGCTATTATTGTAAATCACATAAACCGCCCATTAGTTTTCCATTGTGTGCTAATGGACAAGTTTTTGGTTTATATAAAAATACATGTGTTGGTAGCGATAATGTTACTGACTTTAATGCAATTGCAACATGTGACTGGACAA ?GGGHHHHFGFHHHHHHHFHHGHGHHFHHHGGGGGHGHHHHHHHHHHHHHHEGHHHHHHHHHHHHHHHHGHHHHHHGHHHHHHHHHHHHHHHHHGGGGHHGGHGHHHHHHHHHHHHHHHHHHHHHGHHHGGGGGGGGGGFFFFFFFCCCBB NM:i:1 MD:Z:25A115 MC:Z:142M9S AS:i:136 XS:i:0
## SRR15168839.2 99 MN908947.3 21305 60 151M = 21304 150 GCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTCA CDDDDDDFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHGGHGGHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFGHHHHHHHHHHHIHHHHHHHHHGHHHHHH NM:i:1 MD:Z:58T92 MC:Z:151M AS:i:146 XS:i:0
## SRR15168839.2 147 MN908947.3 21304 60 151M = 21305 -150 TGCGAACAAATAGATGGTTATGTCATGCATGCAAATTACATATTTTGGAGGAATACAAACCCAATTCAGTTGTCTTCCTATTCTTTATTTGACATGAGTAAATTTCCCCTTAAATTAAGGGGTACTGCTGTTATGTCTTTAAAAGAAGGTC GGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHIHIHHHHGHHHHGGGHHHHHHHHHHHGHHHHHHHHHHHGGGGGGGGGGFFFFFFFBBABA NM:i:2 MD:Z:0C58T91 MC:Z:151M AS:i:145 XS:i:0
## SRR15168839.3 97 MN908947.3 28222 60 49M1D102M = 28285 214 AGTATCATGACGTTCGTGTTGTTTTAGATTTCATCTAAACGAACAAACTAAATGTCTCTAAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTG AAAAAFFFFFFCGGGGEFGGGGHHHHGHHHHHHHHHHHHHGGGGGHHHHHHHHHHHHHHHHHHHGHHHGGGGHHHHHHHGGGGHHHHHHGGGGGGGHHGHHHGGHGGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHGHHHHHHHGGGGGH NM:i:4 MD:Z:49^A8G0A0T91 MC:Z:151M AS:i:129 XS:i:0
## SRR15168839.3 145 MN908947.3 28285 60 151M = 28222 -214 TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC HHGDGGGHGHHHGGGGHHHHHHGFCGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGGGGGHHHGGGGGGGGGHGGHHHHFGFEGGGGGGGHHHHHHGHHHGHGHHHGGGGGGGGGGFFBFBFFBBBBB NM:i:0 MD:Z:151 MC:Z:49M1D102M AS:i:151 XS:i:0
## SRR15168839.4 99 MN908947.3 17888 60 151M = 17887 150 AAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTACA BBBBBFFFFFFFGGGGDGGGGGHHGHHHFHHHFHHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHFHHFHFHHHHHHHHHF NM:i:1 MD:Z:107T43 MC:Z:151M AS:i:146 XS:i:0
## SRR15168839.4 147 MN908947.3 17887 60 151M = 17888 -150 TAAACAGCTCACTCTTGTAATGTAAACAGATTTAATGTTGCTATTACCAGAGCAAAAGTAGGCATACTTTGCATAATGTCTGATAGAGACCTTTATGACAAGTTGCAACTTACAAGTCTTGAAATTCCACGTAGGAATGTGGCAACTTTAC HHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGHHHHHHIIIHHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGHGGGGGGGGGGFFFFFFFCCAAA NM:i:2 MD:Z:0G107T42 MC:Z:151M AS:i:145 XS:i:0
## SRR15168839.5 99 MN908947.3 28286 60 151M = 28285 150 GGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACCA BCCCCCCCCFCFGGGGGGGGGGHHHHGGGGGGGHHGHHHGGGFGHHHGHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHGHHHGGGGGHHHGGGGGGGGGHHHHHHHGHHGGGGGGGGHGGHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHG NM:i:1 MD:Z:150G0 MC:Z:151M AS:i:150 XS:i:0
## SRR15168839.5 147 MN908947.3 28285 60 151M = 28286 -150 TGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAACTGGCAGTAACCAGAATGGAGAACGCAGTGGGGCGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGCGTCTTGGTTCACC GHGGGGGHHHHHGGGGHHHHHHGFGGGGGHHGHHHGHHHHHHHHHGGGHHHHHHHHGHHHHHHHHHGHHHHHHHHHHHGGGGGHHHGGGGGGGGGHHGHHHHGHGGGGGGGGGHHHHHHHHHHGHHHHHGGGGGGGGGGFFDFCFFACBCA NM:i:0 MD:Z:151 MC:Z:151M AS:i:151 XS:i:0
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1
Calculate read alignment statistics:
samtools flagstat tutorial/bwa/SRR15168839.sam
## 2006 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 6 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1991 + 0 mapped (99.25% : N/A)
## 1985 + 0 primary mapped (99.25% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 1864 + 0 properly paired (93.20% : N/A)
## 1982 + 0 with itself and mate mapped
## 3 + 0 singletons (0.15% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
Before we finish, one thing you might have noticed is the difference in ‘properly paired’ alignment rates between the Bowtie 2 and BWA aligners. Unfortunately, this is not simple to explain. The ‘properly paired’ definition is not standard and each aligner uses different rules to define whether a pair of reads is properly paired or not. To know why these rates are different you will have to look up the Bowtie 2 and BWA definitions. We’ll leave that as a puzzle you to solve, if you’re curious.
The exercises below are designed to strengthen your knowledge of using Bowtie 2 and BWA to align short reads to a reference genome. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Create a directory to store the output files from each exercise:
mkdir exercises
mkdir exercises/ebola
mkdir exercises/hiv
mkdir exercises/dengue
Ebola virus disease (EVD), formerly known as Ebola haemorrhagic fever, is a severe, often fatal illness affecting humans and other primates. The virus is transmitted to people from wild animals (such as fruit bats, porcupines and non-human primates) and then spreads in the human population through direct contact with the blood, secretions, organs or other bodily fluids of infected people, and with surfaces and materials (e.g. bedding, clothing) contaminated with these fluids.
efetch command from EDirect to download and save the Ebola reference genome (AF086833):efetch -db nuccore -id AF086833 -format fasta > exercises/ebola/AF086833.fasta
fastq-dump command from the SRA Toolkit to download 10,000 paired-end reads from a sequencing run (SRR1972739) of an Ebola sample:fastq-dump -X 10000 -O exercises/ebola/ --split-files SRR1972739
## Read 10000 spots for SRR1972739
## Written 10000 spots for SRR1972739
seqkit fx2tab -n -g -l exercises/ebola/AF086833.fasta
## AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome 18959 41.07
bwa index -p exercises/ebola/AF086833 exercises/ebola/AF086833.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p exercises/ebola/AF086833 exercises/ebola/AF086833.fasta
## [main] Real time: 0.005 sec; CPU: 0.022 sec
bwa mem exercises/ebola/AF086833 exercises/ebola/SRR1972739_1.fastq exercises/ebola/SRR1972739_2.fastq > exercises/ebola/SRR1972739.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 20000 sequences (2020000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (738, 5739, 6, 663)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (74, 114, 180)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 392)
## [M::mem_pestat] mean and std.dev: (131.35, 75.24)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 498)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (174, 220, 276)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 480)
## [M::mem_pestat] mean and std.dev: (227.70, 77.37)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 582)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (77, 122, 175)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 371)
## [M::mem_pestat] mean and std.dev: (131.18, 71.99)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 469)
## [M::mem_process_seqs] Processed 20000 reads in 1.011 CPU sec, 1.011 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/ebola/AF086833 exercises/ebola/SRR1972739_1.fastq exercises/ebola/SRR1972739_2.fastq
## [main] Real time: 1.030 sec; CPU: 1.031 sec
samtools flagstat command to report the alignment statistics:samtools flagstat exercises/ebola/SRR1972739.sam
## 20740 + 0 in total (QC-passed reads + QC-failed reads)
## 20000 + 0 primary
## 0 + 0 secondary
## 740 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 15279 + 0 mapped (73.67% : N/A)
## 14539 + 0 primary mapped (72.69% : N/A)
## 20000 + 0 paired in sequencing
## 10000 + 0 read1
## 10000 + 0 read2
## 14480 + 0 properly paired (72.40% : N/A)
## 14528 + 0 with itself and mate mapped
## 11 + 0 singletons (0.05% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
Human immunodeficiency virus (HIV) is an infection that attacks the body’s immune system, specifically the white blood cells called CD4 cells. HIV destroys these CD4 cells, weakening a person’s immunity against opportunistic infections, such as tuberculosis and fungal infections, severe bacterial infections and some cancers.
efetch command from EDirect to download and save the HIV reference genome (AF033819):efetch -db nuccore -id AF033819 -format fasta > exercises/hiv/AF033819.fasta
esearch and efetch commands to download and save the run information from a public HIV sequencing project (PRJNA541016):esearch -db sra -query PRJNA541016 | efetch -format runinfo > exercises/hiv/runinfo.csv
cut -d ',' -f 1 exercises/hiv/runinfo.csv | tail -n +2 | head -n 5 > exercises/hiv/runids.txt
fastq-dump command to download 1000 single-end reads from each of the runs listed in the project. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3:while read RUN; do
fastq-dump -X 1000 -O exercises/hiv ${RUN}
done < exercises/hiv/runids.txt
## Read 1000 spots for SRR9008461
## Written 1000 spots for SRR9008461
## Read 1000 spots for SRR9008494
## Written 1000 spots for SRR9008494
## Read 1000 spots for SRR9008493
## Written 1000 spots for SRR9008493
## Read 1000 spots for SRR9008492
## Written 1000 spots for SRR9008492
## Read 1000 spots for SRR9008491
## Written 1000 spots for SRR9008491
bowtie2-build exercises/hiv/AF033819.fasta exercises/hiv/AF033819
## Settings:
## Output files: "exercises/hiv/AF033819.*.bt2"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 4 (one in 16)
## FTable chars: 10
## Strings: unpacked
## Max bucket size: default
## Max bucket size, sqrt multiplier: default
## Max bucket size, len divisor: 4
## Difference-cover sample period: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## exercises/hiv/AF033819.fasta
## Building a SMALL index
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 2295
## Using parameters --bmax 1722 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 1722 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 2, merged 7; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 1146.75 (target: 1721)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
## Reserving size (1722) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 170 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 171 for bucket 1
## Getting block 2 of 8
## Reserving size (1722) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 1666 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1667 for bucket 2
## Getting block 3 of 8
## Reserving size (1722) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 1066 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1067 for bucket 3
## Getting block 4 of 8
## Reserving size (1722) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 1122 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1123 for bucket 4
## Getting block 5 of 8
## Reserving size (1722) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 1351 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1352 for bucket 5
## Getting block 6 of 8
## Reserving size (1722) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 1442 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1443 for bucket 6
## Getting block 7 of 8
## Reserving size (1722) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 1641 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1642 for bucket 7
## Getting block 8 of 8
## Reserving size (1722) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 716 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 717 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3272
## fchr[G]: 4914
## fchr[T]: 7139
## fchr[$]: 9181
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4197571 bytes to primary EBWT file: exercises/hiv/AF033819.1.bt2.tmp
## Wrote 2300 bytes to secondary EBWT file: exercises/hiv/AF033819.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 9181
## bwtLen: 9182
## sz: 2296
## bwtSz: 2296
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 574
## offsSz: 2296
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 48
## numLines: 48
## ebwtTotLen: 3072
## ebwtTotSz: 3072
## color: 0
## reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 2295
## Using parameters --bmax 1722 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 1722 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 1146.75 (target: 1721)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
## Reserving size (1722) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 1457 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1458 for bucket 1
## Getting block 2 of 8
## Reserving size (1722) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 282 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 283 for bucket 2
## Getting block 3 of 8
## Reserving size (1722) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 1451 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1452 for bucket 3
## Getting block 4 of 8
## Reserving size (1722) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 652 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 653 for bucket 4
## Getting block 5 of 8
## Reserving size (1722) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 1423 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1424 for bucket 5
## Getting block 6 of 8
## Reserving size (1722) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 906 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 907 for bucket 6
## Getting block 7 of 8
## Reserving size (1722) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 1721 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1722 for bucket 7
## Getting block 8 of 8
## Reserving size (1722) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 1282 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1283 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3272
## fchr[G]: 4914
## fchr[T]: 7139
## fchr[$]: 9181
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4197571 bytes to primary EBWT file: exercises/hiv/AF033819.rev.1.bt2.tmp
## Wrote 2300 bytes to secondary EBWT file: exercises/hiv/AF033819.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 9181
## bwtLen: 9182
## sz: 2296
## bwtSz: 2296
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 574
## offsSz: 2296
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 48
## numLines: 48
## ebwtTotLen: 3072
## ebwtTotSz: 3072
## color: 0
## reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming exercises/hiv/AF033819.3.bt2.tmp to exercises/hiv/AF033819.3.bt2
## Renaming exercises/hiv/AF033819.4.bt2.tmp to exercises/hiv/AF033819.4.bt2
## Renaming exercises/hiv/AF033819.1.bt2.tmp to exercises/hiv/AF033819.1.bt2
## Renaming exercises/hiv/AF033819.2.bt2.tmp to exercises/hiv/AF033819.2.bt2
## Renaming exercises/hiv/AF033819.rev.1.bt2.tmp to exercises/hiv/AF033819.rev.1.bt2
## Renaming exercises/hiv/AF033819.rev.2.bt2.tmp to exercises/hiv/AF033819.rev.2.bt2
while read RUN; do
bowtie2 --very-sensitive -x exercises/hiv/AF033819 -U exercises/hiv/${RUN}.fastq > exercises/hiv/${RUN}.sam
done < exercises/hiv/runids.txt
## 1000 reads; of these:
## 1000 (100.00%) were unpaired; of these:
## 740 (74.00%) aligned 0 times
## 260 (26.00%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 26.00% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were unpaired; of these:
## 162 (16.20%) aligned 0 times
## 838 (83.80%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 83.80% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were unpaired; of these:
## 52 (5.20%) aligned 0 times
## 948 (94.80%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 94.80% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were unpaired; of these:
## 355 (35.50%) aligned 0 times
## 645 (64.50%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 64.50% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were unpaired; of these:
## 234 (23.40%) aligned 0 times
## 765 (76.50%) aligned exactly 1 time
## 1 (0.10%) aligned >1 times
## 76.60% overall alignment rate
while read RUN; do
samtools flagstat exercises/hiv/${RUN}.sam > exercises/hiv/${RUN}_flagstat.txt
done < exercises/hiv/runids.txt
grep to find and report the number of mapped reads for each run:# limit to first match (-m 1)
grep -m 1 "mapped" exercises/hiv/*_flagstat.txt
## exercises/hiv/SRR9008461_flagstat.txt:260 + 0 mapped (26.00% : N/A)
## exercises/hiv/SRR9008491_flagstat.txt:766 + 0 mapped (76.60% : N/A)
## exercises/hiv/SRR9008492_flagstat.txt:645 + 0 mapped (64.50% : N/A)
## exercises/hiv/SRR9008493_flagstat.txt:948 + 0 mapped (94.80% : N/A)
## exercises/hiv/SRR9008494_flagstat.txt:838 + 0 mapped (83.80% : N/A)
Dengue is a mosquito-borne viral infection that is common in warm, tropical climates. Infection is caused by any one of four closely related dengue viruses (called serotypes) and these can lead to a wide spectrum of symptoms, including some which are extremely mild (unnoticeable) to those that may require medical intervention and hospitalization. In severe cases, fatalities can occur. There is no treatment for the infection itself but the symptoms that a patient experiences can be managed.
efetch command from EDirect to download and save the Dengue reference genome (NC_001477):efetch -db nuccore -id NC_001477 -format fasta > exercises/dengue/NC_001477.fasta
esearch and efetch commands to download and save the run information from a public Dengue sequencing project (PRJNA494391):esearch -db sra -query PRJNA494391 | efetch -format runinfo > exercises/dengue/runinfo.csv
tail -n +2 exercises/dengue/runinfo.csv | grep "Dengue virus" | awk -F "," '($4 > 1500000)' | cut -d "," -f 1 > exercises/dengue/runids.txt
fastq-dump command to download 100,000 paired-end reads from each of the runs listed in the project. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3:while read RUN; do
fastq-dump -X 100000 -O exercises/dengue --split-files ${RUN}
done < exercises/dengue/runids.txt
## Read 100000 spots for SRR8069228
## Written 100000 spots for SRR8069228
## Read 100000 spots for SRR8069229
## Written 100000 spots for SRR8069229
bwa index -p exercises/dengue/NC_001477 exercises/dengue/NC_001477.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p exercises/dengue/NC_001477 exercises/dengue/NC_001477.fasta
## [main] Real time: 0.004 sec; CPU: 0.021 sec
bowtie2-build exercises/dengue/NC_001477.fasta exercises/dengue/NC_001477
## Settings:
## Output files: "exercises/dengue/NC_001477.*.bt2"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 4 (one in 16)
## FTable chars: 10
## Strings: unpacked
## Max bucket size: default
## Max bucket size, sqrt multiplier: default
## Max bucket size, len divisor: 4
## Difference-cover sample period: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## exercises/dengue/NC_001477.fasta
## Building a SMALL index
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 2683
## Using parameters --bmax 2013 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 2013 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 2, merged 7; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 2; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Avg bucket size: 1532.71 (target: 2012)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
## Reserving size (2013) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 1449 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1450 for bucket 1
## Getting block 2 of 7
## Reserving size (2013) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 1668 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1669 for bucket 2
## Getting block 3 of 7
## Reserving size (2013) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 1970 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1971 for bucket 3
## Getting block 4 of 7
## Reserving size (2013) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 2005 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2006 for bucket 4
## Getting block 5 of 7
## Reserving size (2013) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 1766 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1767 for bucket 5
## Getting block 6 of 7
## Reserving size (2013) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 1080 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1081 for bucket 6
## Getting block 7 of 7
## Reserving size (2013) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 791 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 792 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3426
## fchr[G]: 5666
## fchr[T]: 8436
## fchr[$]: 10735
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4198093 bytes to primary EBWT file: exercises/dengue/NC_001477.1.bt2.tmp
## Wrote 2688 bytes to secondary EBWT file: exercises/dengue/NC_001477.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 10735
## bwtLen: 10736
## sz: 2684
## bwtSz: 2684
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 671
## offsSz: 2684
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 56
## numLines: 56
## ebwtTotLen: 3584
## ebwtTotSz: 3584
## color: 0
## reverse: 0
## Total time for call to driver() for forward index: 00:00:00
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 2683
## Using parameters --bmax 2013 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 2013 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 1532.71 (target: 2012)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
## Reserving size (2013) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 1082 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1083 for bucket 1
## Getting block 2 of 7
## Reserving size (2013) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 1956 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1957 for bucket 2
## Getting block 3 of 7
## Reserving size (2013) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 2009 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 2010 for bucket 3
## Getting block 4 of 7
## Reserving size (2013) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 1882 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1883 for bucket 4
## Getting block 5 of 7
## Reserving size (2013) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 1056 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1057 for bucket 5
## Getting block 6 of 7
## Reserving size (2013) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 1079 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1080 for bucket 6
## Getting block 7 of 7
## Reserving size (2013) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 1665 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 1666 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 3426
## fchr[G]: 5666
## fchr[T]: 8436
## fchr[$]: 10735
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 4198093 bytes to primary EBWT file: exercises/dengue/NC_001477.rev.1.bt2.tmp
## Wrote 2688 bytes to secondary EBWT file: exercises/dengue/NC_001477.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 10735
## bwtLen: 10736
## sz: 2684
## bwtSz: 2684
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 671
## offsSz: 2684
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 56
## numLines: 56
## ebwtTotLen: 3584
## ebwtTotSz: 3584
## color: 0
## reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:00
## Renaming exercises/dengue/NC_001477.3.bt2.tmp to exercises/dengue/NC_001477.3.bt2
## Renaming exercises/dengue/NC_001477.4.bt2.tmp to exercises/dengue/NC_001477.4.bt2
## Renaming exercises/dengue/NC_001477.1.bt2.tmp to exercises/dengue/NC_001477.1.bt2
## Renaming exercises/dengue/NC_001477.2.bt2.tmp to exercises/dengue/NC_001477.2.bt2
## Renaming exercises/dengue/NC_001477.rev.1.bt2.tmp to exercises/dengue/NC_001477.rev.1.bt2
## Renaming exercises/dengue/NC_001477.rev.2.bt2.tmp to exercises/dengue/NC_001477.rev.2.bt2
while read RUN; do
bwa mem exercises/dengue/NC_001477 exercises/dengue/${RUN}_1.fastq exercises/dengue/${RUN}_2.fastq > exercises/dengue/${RUN}_bwa.sam
bowtie2 -x exercises/dengue/NC_001477 -1 exercises/dengue/${RUN}_1.fastq -2 exercises/dengue/${RUN}_2.fastq > exercises/dengue/${RUN}_bowtie2.sam
done < exercises/dengue/runids.txt
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 68646 sequences (10000256 bp)...
## [M::process] read 68728 sequences (10000112 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7649, 8664, 2, 7447)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 70)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 170)
## [M::mem_pestat] mean and std.dev: (46.94, 31.55)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 220)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (217, 265, 306)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (39, 484)
## [M::mem_pestat] mean and std.dev: (261.49, 69.92)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 573)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 40, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 160)
## [M::mem_pestat] mean and std.dev: (44.56, 30.72)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 207)
## [M::mem_process_seqs] Processed 68646 reads in 7.466 CPU sec, 7.433 real sec
## [M::process] read 62626 sequences (9120396 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7560, 8655, 2, 7568)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 70)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 170)
## [M::mem_pestat] mean and std.dev: (46.71, 31.48)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 220)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (215, 265, 306)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (33, 488)
## [M::mem_pestat] mean and std.dev: (261.58, 71.85)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 579)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 162)
## [M::mem_pestat] mean and std.dev: (44.19, 30.65)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 210)
## [M::mem_process_seqs] Processed 68728 reads in 7.567 CPU sec, 7.505 real sec
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6874, 8110, 3, 6750)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 42, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 167)
## [M::mem_pestat] mean and std.dev: (46.73, 31.55)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 216)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (211, 263, 305)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (23, 493)
## [M::mem_pestat] mean and std.dev: (259.36, 72.29)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 587)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (17, 39, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 161)
## [M::mem_pestat] mean and std.dev: (43.53, 30.64)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 209)
## [M::mem_process_seqs] Processed 62626 reads in 6.920 CPU sec, 6.898 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/dengue/NC_001477 exercises/dengue/SRR8069228_1.fastq exercises/dengue/SRR8069228_2.fastq
## [main] Real time: 21.909 sec; CPU: 22.027 sec
## 100000 reads; of these:
## 100000 (100.00%) were paired; of these:
## 76891 (76.89%) aligned concordantly 0 times
## 23109 (23.11%) aligned concordantly exactly 1 time
## 0 (0.00%) aligned concordantly >1 times
## ----
## 76891 pairs aligned concordantly 0 times; of these:
## 38413 (49.96%) aligned discordantly 1 time
## ----
## 38478 pairs aligned 0 times concordantly or discordantly; of these:
## 76956 mates make up the pairs; of these:
## 56759 (73.76%) aligned 0 times
## 20197 (26.24%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 71.62% overall alignment rate
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 68464 sequences (10000143 bp)...
## [M::process] read 68462 sequences (10000296 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7349, 8761, 7, 7228)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (20, 40, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 167)
## [M::mem_pestat] mean and std.dev: (45.96, 31.57)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 216)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (217, 269, 309)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (33, 493)
## [M::mem_pestat] mean and std.dev: (264.46, 72.46)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 585)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 66)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 162)
## [M::mem_pestat] mean and std.dev: (44.24, 30.99)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 210)
## [M::mem_process_seqs] Processed 68464 reads in 7.433 CPU sec, 7.400 real sec
## [M::process] read 63074 sequences (9221933 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (7477, 8691, 2, 7439)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 41, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 169)
## [M::mem_pestat] mean and std.dev: (46.10, 31.59)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 219)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (218, 269, 309)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (36, 491)
## [M::mem_pestat] mean and std.dev: (265.25, 71.67)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 582)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (18, 39, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 159)
## [M::mem_pestat] mean and std.dev: (44.12, 30.62)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 206)
## [M::mem_process_seqs] Processed 68462 reads in 7.474 CPU sec, 7.410 real sec
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6753, 8274, 1, 6821)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (19, 41, 69)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 169)
## [M::mem_pestat] mean and std.dev: (46.17, 32.14)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 219)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (211, 264, 304)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (25, 490)
## [M::mem_pestat] mean and std.dev: (259.47, 72.16)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 583)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (17, 38, 65)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 161)
## [M::mem_pestat] mean and std.dev: (43.31, 30.89)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 209)
## [M::mem_process_seqs] Processed 63074 reads in 6.931 CPU sec, 6.911 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/dengue/NC_001477 exercises/dengue/SRR8069229_1.fastq exercises/dengue/SRR8069229_2.fastq
## [main] Real time: 21.806 sec; CPU: 21.919 sec
## 100000 reads; of these:
## 100000 (100.00%) were paired; of these:
## 76672 (76.67%) aligned concordantly 0 times
## 23328 (23.33%) aligned concordantly exactly 1 time
## 0 (0.00%) aligned concordantly >1 times
## ----
## 76672 pairs aligned concordantly 0 times; of these:
## 37483 (48.89%) aligned discordantly 1 time
## ----
## 39189 pairs aligned 0 times concordantly or discordantly; of these:
## 78378 mates make up the pairs; of these:
## 59066 (75.36%) aligned 0 times
## 19312 (24.64%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 70.47% overall alignment rate
while read RUN; do
samtools flagstat exercises/dengue/${RUN}_bwa.sam > exercises/dengue/${RUN}_bwa_flagstat.txt
samtools flagstat exercises/dengue/${RUN}_bowtie2.sam > exercises/dengue/${RUN}_bowtie2_flagstat.txt
done < exercises/dengue/runids.txt
grep to find and report the number of mapped reads for each run using the BWA aligner:# limit to first match (-m 1)
grep -m 1 "mapped" exercises/dengue/*_bwa_flagstat.txt
## exercises/dengue/SRR8069228_bwa_flagstat.txt:178963 + 0 mapped (89.33% : N/A)
## exercises/dengue/SRR8069229_bwa_flagstat.txt:176126 + 0 mapped (87.93% : N/A)
grep to find and report the number of mapped reads for each run using the Bowtie 2 aligner:# limit to first match (-m 1)
grep -m 1 "mapped" exercises/dengue/*_bowtie2_flagstat.txt
## exercises/dengue/SRR8069228_bowtie2_flagstat.txt:143241 + 0 mapped (71.62% : N/A)
## exercises/dengue/SRR8069229_bowtie2_flagstat.txt:140934 + 0 mapped (70.47% : N/A)